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Abstract 

Let A be a static array storing n elements from a totally ordered set. We present a data 
structure of optimal size at most n log2 (3+2-\/2) + o(rt) bits that allows us to answer the following 
queries on A in constant time, without accessing A: (1) previous smaller value queries, where 
given an index i , we wish to find the first index to the left of i where A is strictly smaller than at 
i, and (2) next smaller value queries, which search to the right of i. As an additional bonus, our 
data structure also allows to answer a third kind of query: given indices i < j, find the position 
of the minimum in j4[i..j]. Our data structure has direct consequences for the space-efficient 
storage of suffix trees. 

1 Introduction 

We consider the situation where a static array ^[l,n] can be preprocessed such that the fohowing 
three queries can be answered in constant time: previous- and next-smaller- value-queries, where 
given a position i in A, one searches for the next position p to the left (or right) of i with A\p] < A[i], 
and range minimum queries, where for two given indices i and j we look for the position of the 
minimum element within the subarray 

Our work is situated in the field of succinct data structures, where the aim is to store objects 
of size n from a universe of size L{n) in IgL(n) + (1 + o(l)) bitq^l, while still being able to perform 
all operations on the data as if they were uncompressed. All succinct data structures work in the 
word-RAM model of computation, where fundamental operations on a contiguous field of w bits 
can be performed in constant time {w is the word size, and we assume Ign = 0{w)). 

Succinct data structures can be further classified into indexing and encoding data structures. 
An indexing data structure enhances an object (such as an array) with additional functionality 
(such as queries) and needs access to the object itself, whereas an encoding data structure recodes 
all necessary parts of the data for answering the queries without accessing the object. 

For range minimum queries alone, there is a data structure in the encoding model of asymptot- 
ically optimal size 2n + o(n) bits that allows to answer queries in constant time [6]. Previous- and 
next-smaller- value queries originate from parallel computing [2]. For all three queries combined, 
the only existing data structure uses 3n + o{n) bits [16] . 

In this short note, we present an encoding data structure of size at most n lg(3 + 2\/2) -|- o(n) ^ 
2.54n -|- o{n) bits that allows to answer all three queries in constant time. It is interesting to note 
that although we do not have a closed formula for the exact size of our data structure, we prove 
that it is asymptotically optimal. The reason for this slight oddity is that we are not aware of a 



'Computer Science Department, Karlsruhe University, johaimes.fischer@kit.edu 
^Throughout this article, Ig denotes the binary logarithm. 



1 



closed formula for the size L of the universe of objects that we encode; however, we prove that we 
encode them in an asymptotically optimal way. 

Although our data structure is independent of the underlying array A and may have other 
applications, our research is clearly motivated by the compact storage of full-text indices jlSj . 
Precisely, we show that our data structure automatically yields the smallest compressed suffix tree 
with constant-time navigation (we refer the reader to Sect. [4] for more details and preliminary work 
on compressed suffix trees). 

The rest of this note is structured as follows. Sect. [2] introduces some notation and known 
results. Sect. [3] presents the core idea of the paper, a combined data structure for s and PSV-/NSV- 
queries. Finally, Sect. H] describes how that data structure yields improvements in compressed suffix 
trees. 

2 Preliminaries 

For integers i and j, we write to denote the set + 1, . . . and to denote {i + 

1, . . . , j — 1}. For a rooted tree T and a node v, we write Tv to denote the subtree of T rooted at v. 

2.1 Queries 

Let j4[l,n] be an array of totally ordered objects. For technical reasons, we define ^[0] = — oo = 
A[n + 1] as the "artificial" overall minima of the array. We start by formally defining previous 
smaller values: 

Definition 1. For 1 < i < n, let PSV^(i) = maxjj < i : A[j] < A[i]^ denote the previous smaller 
value of position i. 

As mentioned in the introduction, we also consider next smaller values and range minima, for 
completeness formally defined as follows. 

Definition 2. For 1 < i < n, let NSVyi(i) = minjj > i : A[j] < A[i]^ denote the next smaller 
value of position i. 

Definition 3. For 1 < i < j < n, let RKQy^{i) = argmin|A[A;] : i < k < denote a range 
minimum query between positions i and j. If the minimum in the query range is not unique, the 
leftmost (or rightmost) minimum is chosen as a representative. 

In the following, the subscript A from RMQ^ etc. will be omitted if the underlying array A is 
clear from the context. 

2.2 LRM-Trees 

LRM- Trees are the basis of our new data structure. They were introduced under this name as an 
internal tool for basic navigational operations in ordinal trees [21], and, under the name of "2d-Min 
Heaps," to encode integer arrays in order to support range minimum queries on them [6]. 

Definition 4 (Sadakane and Navarro [21]; Fischer [6j). The LRM- Tree of A is an ordered labeled 
tree with vertices 0, . . . ,n. For 1 < i < n, PSV(i) is the parent node of i. The children are ordered 
in increasing order from left to right. 
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We note the following useful properties of the LRM-Tree (observe that we use nodes and array 
indices interchangeably throughout this article): 

Lemma 5 (Fischer [6]). Let T be the LRM-Tree of A. 

1. The node labels correspond to the preorder-numbers ofT (counting starts at 0). 

2. Let i be a node in T with children xi, . . . , Xk- Then A[i\ < A[xj] for all I < j < k. 

3. Again, let i be a node in T with children xi, . . . ,Xk- Then A[xj] < A[xj^i] for all 1 < j < k. 

2.3 Succinct Tree Encodings 

A rooted ordered tree on n nodes can be encoded in 2n + o(n) bits in various ways such that it 
still permits (the simulation of) all navigational operations in constant time, such as BPS [13] or 
DFUDS [T]. Of particular importance to this article are methods based on tree covering (TC) 
O [TT\ [5]. They support most navigational operations on trees in constant time, among others 
Root(), PARENT(ii), FiRSTCHiLD(ii), NEXTSiBLiNG(n) , SubtreeSize(u), selecting the i'th child 
(lTHCHlLD(n, i)), computing the rank of a child among its siblings (CHlLDRANK(n)), and com- 
puting lowest common ancestors (LcA(n, u)). Farzan and Munro's approach [5J has the further 
advantage that it can also optimally encode other types of trees, such as those described in the 
following section. 

2.4 Schroder Trees 

The term Schroder Tree is used for various types of rooted ordered trees [22]: trees with no nodes 
of out-degree 1, trees with labeled edges, or trees with labeled nodes. For our purposes, we define 
them as follows. 

Definition 6. A Schroder Tree is a rooted ordered tree, where any node except the first child in a 
list of siblings may be colored red or blue. First children are always colored blue. 

The number of Schroder Trees on n nodes is counted by the little Schroder numbers 5„. Al- 
though we do not have a closed formula for Sn, it is known [T2] that 5„ = "^7^^^~[y(l + 0{n~^)) 

with p := 3 + 2\/2- In particular, Sn < p". 

3 Data Structure 

In this section, we present the new data structure for answering RMQ/PSV/NSV on an input array A. 
We start by introducing the general ideas behind our data structure, and then show how this data 
structure can be encoded succinctly. 

3.1 Basic Solution 

The LRM-Tree (Def. H]) encodes all information for answering PSV-queries in a natural way, as 
it suffices to move to the parent node of i for answering PSV(i). It also captures all sufficient 
information for answering RMQs: 
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Lemma 7 (Fischer [6]). For arbitrary nodes i and j in the LRM-Tree of A, 1 < i < j < n, let 
I = LCA(i,j). Then if i = i, RMQ(i,j) is given by i, and otherwise, RMQ(i, j) is given by the child of 
I that is on the path from i to j. 

Thus, it remains to show how NSV-queries can be answered. It is easy to see that the LRM-Tree 
alone is not enough for this task: consider A = [0,0] and A' = [1,0]. These arrays have the same 
LRM-Tree (and hence the same answers to all RMQs and PSV-queries); yet, their NSV-queries differ, 
as NSVa(1) = 3, and NSVa/(1) = 2. 

In principle, we could build another LRM-Tree on the reversed sequence A^ for answering 
NSV=queries, as NSV^(i) = n — PSV^R(n — i + + As this would double the space of the resulting 
data structure, we now present a more sophisticated solution. 

The general idea of our data structure can be seen as follows. Recall property 3 of Lemma [5) 
the children xi,. . . ,Xk of a node v in the LRM-Tree are ordered decreasingly from left to right: 
A[xi] > A[x2] > • • • > ^[2;^]. Now suppose we wish to calculate NSV(xj) for some 1 < i < k— 1, and 
assume that A[xi] > A[xi^i]. Then NSV{xi) = Xj+i, as all A- values in the subtree 7^. are strictly 
greater than at position Xi (property 2 of Lemma[5]). If, on the other hand, A[xi] = A[xi^i], then 
the next "candidate" for NSV(xj) is (assuming i < k — 2), as again all A- values in Txi^i are 
strictly greater than ^[xj+i] = A[xj]. 

This suggests the following general approach. In the LRM-Tree 7" of A, a node is colored red if 
the corresponding value in A is smaller than the ^- value at its left sibling (if such a sibling exists). 
More formally, let t; be a node in T with children xi, . . . ,Xk- Then for all 2 < i < k, node Xi is 
colored red if and only if A[xi] < A[xi-i]. All other nodes (including the root) are colored blue. 
We call the resulting tree the Colored LRM-Tree. 

To get the connection to NSV-queries, we need the following definition: 

Definition 8. Let the Colored LRM-Tree of A[l,n], and let v be a node in with children 
xi, . . . ,Xk. The next red sibling NRS(xj) of a node Xi is the leftmost sibling to the right of Xi that 
is colored red. If such a sibling does not exist, we define NRS(xj) =_L. In symbols, let M = {i < 
j <k : Xj is colored red}. Then NRs(xj) =_L if M = 9, and otherwise NRS(xj) = XminM- 

We can then show the following lemma: 

Lemma 9. Let T'^ the Colored LRM-Tree of A[l,n], and let v be a node in with children 
xi, . . . ,Xk, xi < X2 < ■ ■ ■ < Xj.. Then 



Proof. We consider each case in turn. 

NRS(xi) 7^±. Let j be defined by Xj = NRS(xj). From Def. [8] and the fact that node Xj is red, we 
know that A[xj\ < A[xi]. Hence, we need to show that A[h] > A[xi] for all h G {xi,Xj). From 
property 1 of Lemma [5l we know that all values in (xj, xj) occur in T^, . . . , T^^-i- Because j 
is minimal and due to property 3 of Lemma [5l ^[^] = ^[xj] for h = Xj+i, . . . ,Xj—i. But due 
to property 2 of Lemma [5l A[h] > A[xi] for all h £ [x^ + l,Xa+i — 1] and all i < a < j — 1. 
Hence, NSV(2;j) = Xj. 




4 



NRs(xj) =_L. Let y = Xk + SuBTREESiZE(xfe). As above, we can show that A[h] > A[xi\ for all 
Xi < h < y. It thus remains to show that A[y] < A[xi]. For the sake of contradiction, 
assume that A[y] > A[xi], where we further distinguish between the cases "=" and ">." If 
A[y] = A[xi], then PSV(?/) = v (the parent node of Xi), so y is the right sibling of Xk, a 
contradiction to the definition of x^- If A[y] > A[xi] = A[xi:], then again due to property 2 
of Lemma m we have PSV(y) ^ [xk,y — 1]. So 7^ contains y, a contradiction to the size of 
7^*^, which is y — Xj, as 7^^ contains exactly those elements from [xk,y)- 

□ 

3.2 Succinct Encoding 

We represent the Colored LRM-Tree from Sect. 13.11 similar to Farzan and Munro's succinct 
TC-encoding for ordinal trees [5] . This approach is based on a two- level decomposition of the tree 
into mini- and micro-trees. In our scenario, the encoding of a micro-tree is simply its index in an 
enumeration of all Schroder Trees of the micro-tree size (called "enumeration code" in [5j). In total, 
this uses optimal IgSn + o(n) bits of space. 

It remains to show how we implement the query algorithms for RMQ, PSV, and NSV. 

As PSV(i) = PARENT(i) and the parent-operation is directly supported by TC, we can directly 
focus on NSV. Recall Lemma O given i, we need to find NRS(i) in order to answer NSV(i). The 
NRS-method can be implemented as the combination of modified IthChild- and ChildRank- 
operations, as they are described by Farzan and Munro [5] (see [H p. 23] for further details). In 
particular, given node i, we find the parent p of i, and then determine the rank r of i among all 
its red siblings, from where we select the r + I'st red node. To this end, if p is a root of a mini- or 
micro-tree, we use a modified fully indexable dictionary (FID) |18j to rank/select among the red 
nodes. These FIDs are similar to the ones already stored at each mini- or micro-tree root, with the 
difference that they index only the red nodes. Similar to the original analysis, their overall space 
contributes only o(n) bits to the final space. If, on the other hand, p is not a mini- or micro-tree 
root, we use the lookup-tables stored along with the micro-trees to rank/select among the red nodes. 
These lookup-tables also use only o(n) bits, as we use micro-trees of size 0(logpn/4). Finally, if 
NRs(i) =_L, we move to the rightmost sibling j of i and count the subtree size at j; both operations 
are supported in 0(1) time by TC. 

For implementing RMQ(i,j), we have to show how the operations in Lemma [7| can be performed 
in constant time. We cannot resort to the method described by Fischer [6], as it is inherently 
connected to DFUDS. We thus do the following: first compute £ = LcA(i,j); this is supported by 
TC [21 E]. Then if i ^ i (otherwise we return i), compute the depth d of £ (depth is supported by 
TC). Finally, compute the child of £ that is on the path to j by a level-ancestor query Laq(j, d + 1) 
(supported by TC); this is the answer. 

Theorem 1. For an array of n totally ordered objects, there is a data structure using lg5„ + o(n) < 
nlg(3 + 2-v/2) + o(n) 2.54n + o(n) bits of space that supports RKQs, PSV- and NSV-gueries on A in 
0(1) time, without accessing A at query time. 

3.3 Optimality 

It is easy to see that the encoding from Sect. l3.2l is optimal. Given any data structure supporting 
PSV and NSV on some underlying input array A, we can reconstruct the Colored LRM-Tree T'^ of 



5 



A, without knowing A: We first create T*^'s rightmost path n = xi,X2, ■ ■ ■ ,Xk = in a bottom- 
up manner, by successively querying Xi^i = PSV(xj), until arriving at x^ = 0. All nodes are 
initially colored blue. This leaves us with unprocessed intervals [xi + 1, Xj+i — 1], which are handled 
recursively. During these recursive calls, suppose that a query PSV(t;) brings us to a node u which 
is already present in the (partial) LRM-TVee T^. Let w be the smallest child of u greater than 
V (i.e., the leftmost child of u to the right of v). We then check if NSV(f) = w, in which case we 
color w red. Otherwise (NSV(f) > w), w remains blue, as in this case A[v] = A[w]. This procedure 
correctly reconstructs the Colored LRM-Tree of A. 

As every Schroder Tree is also a Colored LRM-Tree for some array A (starting at the root with 
children xi, . . . ,Xk, set A[xk] to 0, and ^[xj-i] to A[xi] or A[xi] + 1, depending on whether Xi is 
colored blue or red; the unprocessed intervals are handled recursively), we need at least Ig Sn bits 
to encode &a in the worst case. This proves the optimality of the data structure from Thm. [TJ 

4 Application to Compressed Suffix Trees 

The result from Thm. [1] has direct consequences for compressed suffix trees (CSTs). A suffix tree 
(ST) for a string S of length n is a compact trie storing all the suffixes of S, in the sense that the 
characters on any root-to-leaf path spell out exactly a suffix. The ST is an extremely important data 
structure with applications in exact or approximate string matching, bioinformatics, and document 
retrieval, to mention only a few examples. 

The drawback of STs is their huge space consumption of 20-40 times the text size (O(nlgn) 
bits in theory), even when using carefully engineered implementations. To reduce their size, in 
recent years several authors provided compressed variants of STs [HI [201 [13 E [13 S [HI [7j . 

We regard the CST as an abstract data type supporting the following operations (apart from 
the usual navigational operations on trees as those mentioned in Sect. 12. 3p : LeafCount(m) gives 
the number of leaves (suffixes) below u, LeafLabel(u) for a leaf u yields the position in S where 
the corresponding suffix begins, STRlNGDEPTH(ti) gives u's string-depth (number of characters on 
the root-to-u path), SuffixLink(m) gives the unique node v with root-to-f label a G S* if the 
root-to-u label is aa for some o S S, and Child(m, a) gives the child v u such that the label on 
the edge {u,v) starts with a G S. Here and in the following, T, denotes the underlying alphabet of 
size a. See the first column of Tbl. [T]for all operations {level ancestor queries are excluded as we 
are not a aware of any actual algorithm that needs them in a suffix tree) . 

A CST on S can be divided into three components: (1) the suffix array SA, specifying the 
lexicographic order of S's suffixes, defined by 5'sA[i]..ri, < 5'sA[2]..n < ••• < 'S'sA[n]..n (hence SA 
captures information on the leaves); (2) the LCP-array LCP, storing the lengths of the longest 
common prefixes of lexicographically adjacent suffixes: LCP[1] = — 1 and for 2 < i < n, LCP[z] = 
max{A: > : 5'sA[i]..SA[i]+fc-i = 5'sA[j-i]..SA[i-i]+fc-i}> which is the string-depth of the LCA of the 
lexicographically i'th and i — I'st suffix (hence LCP captures information on internal nodes); and 
(3) additional data structures for simulating the navigational operations. The goal of a CST is to 
compress each of these three components. 

We do not discuss here the different time/space tradeoffs for compressing SA and LCP; we just 
mention that both can be compressed into space proportional to the entropy of the underlying text, 
at the cost of increased access times, which we denote by isA and tiCP) respectively. 

Of more interest to us is the fact that most recent CSTs [HI [T7[ [T6] represent a node v as an 
interval [vi : Vr] in LCP and base their navigation on RMQs and PSV-/NSV-queries in LCP. There are 
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Table 1: Comparison of different CSTs (space in bits on top of SA and LCP). The O(-) is omitted 
in all operations. Trees [IHl El [IT] are incomparable to our approach, as they use less space in 
exchange for higher navigation times, denotes the time to compute the position of SA[-] + 1 in 
SA, which is 0(1) in most compressed suffix arrays. 
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two basic strategies for supporting these queries: we can either use structures of size o(n) [HI [3] or 
2n + o(n) [17] bits and substitute "missing information" by a (sub-)logarithmic number of lookups 
to LCP (indexing model), resulting in increased navigation time (see 3rd and 5th column Tbl. [1]). 
The other option [16j is to use a data structure that computes RMQ/PSV/NSV without needing access 
to the underlying LCP-array (encoding model). 

Given these observations, the index from Thm. [T] almost directly yields a CST with ~ 2.54n + 
o(n) bits on top of SA and LCP with constant-time support of all operations that do not necessarily 
need access to SA or LCP. See again Tbl. [Ufor a comparison. In particular, we get the smallest CST 
with constant-time navigation. Note that it is of utmost theoretical and practical importance to 
have the smallest possible data structure for the navigational component of a CST, as its 0(n)-term 
is incompressible, whereas the space of the other two components of a CST (SA and LCP) vanishes 
if the entropy of the underlying text does. 

All suffix tree operations (apart from LeafCount, StringDepth, and Child) from Tbl. (D 
can be implemented solely by performing RMQs and PSV-/NSV-queries in LCP, see [HI [16]. Only the 
implementation of the NEXTSlBLlNG-operation relies on structures that are proprietary to [T6] (and 
the one in [8J accesses LCP); we therefore give our own implementation as follows: let v = [vi : Vr] be 
the node whose next sibling we want to compute. First check if v equals the root, and return NULL in 
this case. Otherwise, compute w = [wi : Wj] = Parent(w). If Vj. = Wr, return null, as v does not 
have a right sibling in this case. We now know that Vr + 1 is the leftmost index of NEXTSiBLiNG(t;). 
To determine the rightmost index, check if NSV(vr + 1) = Wr + 1, and return [vr + l,tfr] in this 
case, as then V is the second-to-last child of w. Otherwise, return [vr + l,RMQ(?;r + 2,1/;^) — 1], as 
the range minimum query returns a position in LCP where the string-depth of w is stored. 

Theorem 2. Let S be a text of size n with characters from an alphabet of size a. Given S 's suffix 
array with access time tsA o^^t^ its LCP-array with access time t]_cp, there is a CST with additional 
Ig Sn + o{n) < 2.54n + o(n) bits that supports the operations as indicated in the last column of 
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Tbim 



□ 



Our CST resides in between [T7] and [16]: it is smaller than [16] and larger than [17], but equally 
fast as the larger of these [16] . 

It is interesting to note that our IgSn ~ 2.54n bits are also optimal for encoding the topology of 
a suffix tree, as it is a tree with exactly n leaves and no nodes of out-degree 1; the number of such 
trees is also counted by the little Schroder number Sn- However, we cannot make an optimality 
claim for the CST from Thm. [2l as it builds on SA and LCP, who already capture the topology of 
the suffix tree. 
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